*Simon Ustoyev
*Elina Azrilyan
*Jack Russo
*Anil Akyildirim
In our Final Project, we decided perform exploratory analysis on Hillary Clinton's email dataset, including network analysis between the email exchanges between contacts to define centrality measures on contacts, sentiment analysis on the Subject and Body Text leveraging NLTK. Based on our analysis, our goal is to see important connections she has and most used words on Subjects and Body Text of the emails along with Topic Categories.
Hillary Clinton's email documents were released by the State Department as PDFs and Kaggle cleaned and normalized these document and released for public analysis. The data set can be found here: https://www.kaggle.com/kaggle/hillary-clinton-emails
The dataset provided has 4 csv files, Aliases.csv which gives us the unique identifier of the internal reference, Alias (text in the From/To field that refers to the person and Person ID. EmailReceivers.csv also includes unique identifier for internal reference along with the id of the email. Emails gives us the From / To information, Subject Text, Body Text and document number. Persons.cvs provides us the unique identifier and the name of the person.
# Load Packages
# we attempted to use variety of Topic Modeling incliding LDA
import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd
import numpy as np
from pylab import rcParams
import networkx.algorithms.bipartite as bi
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
import plotly
from wordcloud import WordCloud, STOPWORDS
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns
import random
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import wordcloud
import plotly.graph_objs as go
import string
import re
import nltk
import string
import gensim
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
from gensim import corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from nltk import RegexpTokenizer
from gensim.models import LsiModel
from gensim import matutils
from gensim.models.ldamodel import LdaModel
import pyLDAvis
import pyLDAvis.gensim
#import plotly.io as pio
#from IPython.display import Image
#import plotly.io as pio
#pio.renderers.default = "svg"
import warnings
warnings.filterwarnings('ignore')
We start with exploring EmailReceivers.csv to create a network analysis and look at the relationship between the contacts.
# load the data
df = pd.read_csv("EmailReceivers.csv")
df.head()
dfp = pd.read_csv("Persons.csv")
dfp.head()
# Tail Emails DataFrame
df_emails = pd.read_csv("Emails.csv")
df_emails.tail()
We will do prelimary data exploration
# number of email receivers [affiliations]
print('Number of Email Receiver entries: ', len(df))
We see that the number of email receivers are 9,306
# data preperation for network analysis
df['EmailId'] = 'E' + df['EmailId'].astype(str) # Prepend EmailId with 'E'
df['PersonId'] = 'P' + df['PersonId'].astype(str) # Prepend PersonId with 'P'
df.head()
# Creating Graph object
B = nx.from_pandas_edgelist(df, source='PersonId', target='EmailId')
# Check if Graph is 'bipartite'
bi.is_bipartite(B)
# Get list of "Person ID" nodes
idp = df['PersonId'].unique()
# Get list of "Email ID" nodes
ide = df['EmailId'].unique()
# Create and draw a projected graph for person-to-person network
pG = bi.projected_graph(B, idp)
plt.figure(figsize=(40, 40))
pos = nx.spring_layout(pG, k = 0.5)
edg = nx.draw_networkx_edges(pG, pos, width = 3, alpha = 0.2)
pnd = nx.draw_networkx_nodes(pG, pos, node_color = "#9f9fff", node_size = 2000)
lbl = nx.draw_networkx_labels(pG, pos, font_size = 12)
ttl = plt.title('Person-to-person (unweighted) co-affiliation network', loc = 'left', fontsize = 36)
When we look at the person to person (unweighted) co-affiliation network. As expected, a lot of people connected in the middle, getting more and more less as the circle expands.
# Create weighted co-affiliation network
wpG = bi.weighted_projected_graph(pG, idp)
# Get weights
w = np.array([wpG.edges[e]['weight'] for e in wpG.edges])
# Create figure
plt.figure(figsize=(40,40))
# Calculate layout
pos = nx.spring_layout(wpG, k = 1.5)
# Draw edges, nodes, and labels
edg = nx.draw_networkx_edges(wpG, pos, width = 3, alpha = 0.8, edge_color = w, edge_cmap=plt.cm.Blues)
pnd = nx.draw_networkx_nodes(wpG, pos, node_color = "#9f9fff", node_size = 2000)
lbl = nx.draw_networkx_labels(wpG, pos)
ttl = plt.title('Person-to-person (weighted) co-affiliation network', loc = 'left', fontsize = 36)
degree = pd.DataFrame.from_dict(dict(nx.degree(wpG)), orient='index', columns=['Degree'])
degree_centrality = pd.DataFrame.from_dict(nx.degree_centrality(wpG), orient='index', columns=['Degree_Centrality'])
eigenvector_centrality = pd.DataFrame.from_dict(nx.eigenvector_centrality(wpG), orient='index', columns=['Eigenvector_Centrality'])
closeness_centrality = pd.DataFrame.from_dict(nx.closeness_centrality(wpG), orient='index', columns=['Closeness Centrality'])
betweenness_centrality = pd.DataFrame.from_dict(nx.betweenness_centrality(wpG), orient='index', columns=['Betweenness Centrality'])
dfs = [degree,degree_centrality,eigenvector_centrality,closeness_centrality,betweenness_centrality]
metrics = pd.concat([degree, degree_centrality], axis=1, join_axes=[degree.index])
metrics = pd.concat([metrics, eigenvector_centrality], axis=1, join_axes=[metrics.index])
metrics = pd.concat([metrics, closeness_centrality], axis=1, join_axes=[metrics.index])
metrics = pd.concat([metrics, betweenness_centrality], axis=1, join_axes=[metrics.index])
metrics.sort_values(by=['Degree_Centrality', 'Eigenvector_Centrality'], ascending=False).head(10)
Looking at the centrality measures, we see that contacts P81, P80, P272, P87 and P180 has the highest number of neighbours and are the most interacted connections in the emails. We will look at who these people are shortly. Let's further create some islands to see connections closer.
'''
The ‘trim_edges’ function below takes a graph, and applies a threshold (weight),
letting all edges above a certain value through, and removing all others.
'''
def trim_edges(g, weight=1):
g2 = nx.Graph()
for f, to, edata in g.edges(data=True):
if edata['weight'] > weight:
g2.add_edge(f, to, weight = edata['weight'])
return g2
'''
The ‘island_method’ function below will compute evenly spaced thresholds
and produce a list of networks at each threshold level.
'''
def island_method(g, iterations=5):
weights= [edata['weight'] for f,to,edata in g.edges(data=True)]
mn=int(min(weights))
mx=int(max(weights))
#compute the size of the step, so we get a reasonable step in iterations
step=int((mx-mn)/iterations)
return [[threshold, trim_edges(g, threshold)] for threshold in range(mn,mx,step)]
# This function will draw "Person-Node" graph
def plot_person_node_graph(g, fig_sz=(40, 40), node_sz = 2000, spring_layout_k = 0.5):
plt.figure(figsize=fig_sz)
pos = nx.spring_layout(g, k = spring_layout_k)
edg = nx.draw_networkx_edges(g, pos, width = 3, alpha = 0.2)
pnd = nx.draw_networkx_nodes(g, pos, node_color = "#9f9fff", node_size = node_sz)
lbl = nx.draw_networkx_labels(g, pos, font_size = 12)
islands = island_method(wpG)
for i in islands:
# print the threshold level, size of the graph, and number of connected components
print(i[0], len(i[1]), len(list(nx.connected_components(i[1]))))
# Graph of islands [8 9 1]
ig = islands[1][1]
plot_person_node_graph(ig, fig_sz=(20,20))
ttl = plt.title('Person-to-person networks with weight > 8', loc = 'left', fontsize = 36)
We can see that the P87 is connected with P80, P32, P170 and P81. P124 and P272 are outside the island and connected to each other.
# Graph of islands [22 4 1]
ig = islands[3][1]
plot_person_node_graph(ig, fig_sz=(20,20))
ttl = plt.title('Person-to-person networks with weight > 22', loc = 'left', fontsize = 36)
Looking at the people that has weight above 22, we can see it is much smaller triangle. P80(possibly Hillary Clinton) is in the center and connected to P87 and P32 and P81 the strongest. Now, Let's look at who these people are.
P81 = dfp.query('Id == 81')
P80 = dfp.query('Id == 80')
P272 = dfp.query('Id == 272')
P87 = dfp.query('Id == 87')
P180 = dfp.query('Id == 180')
P32 = dfp.query('Id == 32')
people_of_interest = [P81, P80, P272, P87, P180, P32]
people_of_interest = pd.concat(people_of_interest)
people_of_interest
We can see the top people Hillary Clinton is connected to that has the most importance in the email contacts besides herself are, Huma Abedin, Jake Sullivan, Richard Verma and Cheryl Mills.
Now Let's start examining the emails. First let's do a little cleanup of the df_emails dataframe we created earlier.
#adjust emails dataset to add timestamp on the emails
timeStamp_split = df_emails['MetadataDateSent'].str.split("T")
Times = timeStamp_split.str[1]
df_emails['Dates'] = timeStamp_split.str[0]
df_emails['Times'] = Times.str.split("+").str[0]
df_emails['Dates'] = pd.Series(df_emails['Dates'])
df_emails['Times'] = pd.Series(df_emails['Times'])
df_emails_new = df_emails[["MetadataSubject", "ExtractedBodyText", "MetadataTo", "MetadataFrom","Times","Dates"]]
df_emails_new.columns=["Subject", "BodyText", "To", "From", "Times", "Dates"]
df_emails_new.drop(columns=['Times'], inplace=True) #dont need times only dates
df_emails_new.head()
len(df_emails_new)
We see that there are total of 7,945 emails. Let's look at the top 20 peoples in the email.
df_emails_new['To'].value_counts()[0:20] #generate top 20 people email sent to (excluding hillary)
Based on the emails we can see that the top 20 people email sent to maps to the top connections we saw at the network analysis.
df_emails_new['From'].value_counts()[0:20] #generate top 20 persons emails (exluding hillary)
We can see the same above.
Let's further look at the times these emails are generated. To see around when the communication was the highest.
countDates = df_emails_new.Dates.groupby([df_emails_new.Dates]).agg(['count'])
type(countDates)
list(countDates)
countDates2= [go.Scatter(x = list(countDates.index), y=countDates['count'])]
layout = dict(title = 'Count mails per day',
xaxis= dict(title= 'year',ticklen= 10,zeroline= False)
)
fig = dict(data = countDates2, layout = layout)
plotly.offline.iplot(fig)
#Image(pio.to_image(fig, format='png'))

We see that the communication is heavy between the contacts between 2009 and 2011 and goes down in the year of 2012 and further goes up around the end of 2012.
Let's analyze the text that are on the Subject and Body Text of the emails. Let's look at word frequency and the words that are used on both Subject and Body Text of the emails.
# create a function that can plot the wordcloud
def show_wordcloud_1(data, title):
text = ' '.join(data['BodyText'].astype(str).tolist())
stopwords = set(wordcloud.STOPWORDS) # we need to set the stopwords
fig_wordcloud = wordcloud.WordCloud(stopwords=stopwords,background_color='lightgrey',
colormap='viridis', width=800, height=600).generate(text)
plt.figure(figsize=(15,10), frameon=True)
plt.imshow(fig_wordcloud)
plt.axis('off')
plt.title(title, fontsize=20 )
plt.show()
show_wordcloud_1(df_emails_new, "Body Text")
When we look at the body text, we see words such as "White House secretary of", "state", "gov" , "now" and "the president"
# create a function that can plot the wordcloud
def show_wordcloud_2(data, title):
text = ' '.join(data['Subject'].astype(str).tolist())
stopwords = set(wordcloud.STOPWORDS) # we need to set the stopwords
fig_wordcloud = wordcloud.WordCloud(stopwords=stopwords,background_color='lightgrey',
colormap='viridis', width=800, height=600).generate(text)
plt.figure(figsize=(15,10), frameon=True)
plt.imshow(fig_wordcloud)
plt.axis('off')
plt.title(title, fontsize=20 )
plt.show()
show_wordcloud_2(df_emails_new, "Subject")
Looking at the Subject text, we see "Reuters", "Call List", "Thank You", "on", "speech" are some of the words that stands out.
# to remove punctuation and stopwords we need to convert to dtype to string
df_emails_new["BodyText"]=df_emails_new["BodyText"].astype("str")
df_emails_new["Subject"]=df_emails_new["Subject"].astype("str")
# create a function to remove punctuation and stop words
def remove_punctuation_and_stopwords(text):
text_no_punctuation = [ch for ch in text if ch not in string.punctuation]
text_no_punctuation = "".join(text_no_punctuation).split()
text_no_punctuation_no_stopwords = \
[word.lower() for word in text_no_punctuation if word.lower() not in stopwords.words("english")]
return text_no_punctuation_no_stopwords
# create the list of subject and body text words
df_emails_new.loc[:, 'BodyText'] = df_emails_new['BodyText'].apply(remove_punctuation_and_stopwords)
words_body_text = df_emails_new['BodyText'].tolist()
df_emails_new.loc[:, 'Subject'] = df_emails_new['Subject'].apply(remove_punctuation_and_stopwords)
words_subject_text = df_emails_new['Subject'].tolist()
list_body_words = []
for sublist in words_body_text:
for item in sublist:
list_body_words.append(item)
list_subject_words = []
for sublist in words_subject_text:
for item in sublist:
list_subject_words.append(item)
# frequency Dist
fdist_body = nltk.FreqDist(list_body_words)
fdist_subject = nltk.FreqDist(list_subject_words)
# top 30 dataframe
df_body_top30_nltk = pd.DataFrame(fdist_body.most_common(30), columns=['word', 'count'])
df_subject_top30_nltk = pd.DataFrame(fdist_subject.most_common(30), columns=['word', 'count'])
# plot
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(x='word', y='count',
data=df_body_top30_nltk, ax=ax)
plt.title("Top 30 Body words")
plt.xticks(rotation='vertical');
Based on the frequency distribution, we see that the top 30 body words are "pm", "us", "state", "would", "said", and interstingly "ran", "office".
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(x='word', y='count',
data=df_subject_top30_nltk, ax=ax)
plt.title("Top 30 Subject words")
plt.xticks(rotation='vertical');
Top 30 frequently used words in the Subject line of the email are "Call", "schedule", "speech", and "update"
We analyzed the emails and found relationship between the contacts, along with the most communicated people. We also looked at the top frequently used words in the Subject and Body emails. Let's further see if we can look at the topics that are in the Subject Line of the emails.
data = df_emails_new[pd.notnull(df_emails_new['Subject'])]
data["Subject"]=data["Subject"].astype("str")
tokenizer = RegexpTokenizer(r'\w+')
texts = [tokenizer.tokenize(email.lower()) for email in data['Subject']]
def delete_stopwords(tokenized_sentence: list):
return list(filter(lambda x: x not in stop_words, tokenized_sentence))
stop_words = set(stopwords.words('english'))
texts = list(filter(lambda x: len(x) > 5, [delete_stopwords(text) for text in texts]))
corpora_dict = corpora.Dictionary(texts)
#corpora_dict.filter_extremes(no_below=5, no_above=0.5)
corpus = [corpora_dict.doc2bow(text) for text in texts]
print(corpus[:1])
# Human readable format of corpus (term-frequency)
[[(corpora_dict[id], freq) for id, freq in cp] for cp in corpus[:1]]
model_lda = LdaModel(corpus, passes=20, num_topics=10, id2word=corpora_dict)
str_topics = [topic_w for topic_number, topic_w in model_lda.print_topics()]
str_topics_split = list(map(lambda x: x.split("+"), str_topics))
str_topics_split = [list(map(lambda x: x.split("*")[1].strip()[1:-1], elem)) for elem in str_topics_split]
for topic in str_topics_split:
print(topic)
data_lda = pyLDAvis.gensim.prepare(model_lda, corpus, corpora_dict)
pyLDAvis.enable_notebook()
pyLDAvis.display(data_lda)
We performed exploratory analysis on Hillary Clinton's email dataset and got interesting insights from the data. We plotted the weighted and unweighted Network and identified the top 10 people with the highest degree centrality. Island graphs of highest weight confirmed the highest importance individuals in the dataset. We also identified the contacts sending and receiving the highest quantity of emails and the timeframe the majority of the emails were sent during (2009-2011). Then, we performed analysis on the body and subject text of the email - we created word clouds and histograms to identify most frequently used words. Finally, we explored the topics in the Subject line of the emails and created an interactive graph for topics and relevant terms for each of those. Originally we were concerned that the network graphs will not be very insightful since this is an email network centered around Hillary Clinton, but that was not an issue since we were able to point out highest influence individuals in her circle. Nothing in our Text analysis has revealed any dubious words, themes, or interactions.